Linear Regression with Age and MutSignature

Introduction

Previously, we found the apparent deviation between robust regression results when using Matlab and R even with the same data. We intend to explore the detail which leads to the difference.

Data Preparation

Data Source

Matlab format data downloaded from supplementary data of Alexandrov et al(Clock-like mutational processes in human somatic cells), including:

  1. Table_S1.mat(sample age info)
  2. Table_S2.mat(mutSig exposure info)
options(htmlwidgets.TOJSON_ARGS = NULL)
library(heatmaply)
heatmaply(mtcars, k_row = 3, k_col = 2)

Data loading and Processing

Data loading

sam_age_tab1 <- R.matlab::readMat('F:/Table_S1.mat')
sam_exp_tab2 <- R.matlab::readMat('F:/Table_S2.mat')

Data merging

  • Retain samples with age over 0
  • Based on mutational signatures identified from the global cohort, for each individual, the exposure of signatures which didn’t exist in that sample was set 0
  • Note the special case that one sample could be diagnosed as two or more different cancer types. In our cohort, only one sample (i.e. s_266) was the case in which the miserable patent suffered from Chronic lymphocytic leukaemia and Head and neck cancer. In this situation, we only take the exposure data which appear first of intersect signatures (if any).
options(htmlwidgets.TOJSON_ARGS = list(na = 'string'))
library(tidyverse)
library(kableExtra)

# age info from tab1

sam_age_tib1 <- sam_age_tab1[c('cancer.types', 'samples', 'ages')] %>% 
  map(~unlist(.)) %>% 
  bind_rows() %>% 
  select(-cancer.types) %>% 
  filter(ages > 0)

# mutsig exp info from tab2

sam_exp_tib2 <- sam_exp_tab2 %>% 
  map(~unlist(.)) %>% 
  bind_rows()

# merge

sig_fct_lev <- str_sort(unique(sam_exp_tib2$signatures), numeric = TRUE)
age_samples <- sam_age_tib1 %>% distinct(samples)

code_sam_comb <- sam_exp_tib2 %>% 
  semi_join(sam_age_tib1, by = 'samples') %>% 
  select(-cancer.types) %>% 
  group_by(signatures) %>% 
  group_modify(~ {
    distinct(., samples, .keep_all = TRUE) %>% 
      right_join(age_samples) %>% 
      replace_na(list(mutations = 0))
  }) %>% 
  left_join(sam_age_tib1, by = 'samples')

Data descriptive statistics

zero_mut_stat <- code_sam_comb %>% 
  group_by(signatures) %>% 
  summarise(zero_pct = mean(mutations == 0), `#non_zero` = sum(mutations != 0)) %>% 
  mutate(variable = 'mutations')

code_sam_comb %>% 
  select(-samples) %>% 
  mutate(signatures = factor(signatures, levels = sig_fct_lev)) %>% 
  group_by(signatures) %>% 
  skimr::skim() %>% # descriptive statistics
  set_names(map_chr(str_split(colnames(.), '\\.'), ~ifelse(length(.) > 1, .[2], .))) %>% 
  left_join(zero_mut_stat, by = c('signatures', 'skim_variable' = 'variable')) %>% 
  select(signatures:skim_variable, mean:sd, zero_pct, `#non_zero`, p0:hist) %>% 
  mutate_if(is.numeric, ~signif(., 3)) %>% 
  DT::datatable(caption = 'Table 2 Descriptive statistics of sample age and exposure of each signature', 
                rownames = FALSE, 
                options = list(dom = 'tp')) %>% 
  DT::formatStyle('signatures', 
                  target = 'row', 
                  backgroundColor = DT::styleEqual(str_c('Signature ', c(1, 5, 23, 24)), c('green', 'green', '#D7261E', '#D7261E')))
  • Except signature 1, 2 and 5, almost all signatures contain extremely high (> 0.95) zero_pct (referred to the percentage of samples in which the mutation attributed to a particular signature is zero). As extreme cases, signature 23 and 24 only have 0 and 4 non-zero values, respectively.
  • The highly skew distribution of signature exposure is bound to compromise the effect of linear regression.

Simple Linear Regression

code_lm <- code_sam_comb %>% 
  mutate(signatures = factor(signatures, levels = sig_fct_lev)) %>%
  group_by(signatures) %>% 
  group_map(~lm(mutations ~ ages, data = .)) %>% 
  map_dfr(~broom::tidy(.)) %>% 
  filter(term != '(Intercept)') %>% 
  mutate(signatures = sig_fct_lev, p_val_adj = p.adjust(p.value, method = 'BH')) %>% 
  select(signatures, estimate:p_val_adj)

Robust Linear Regression

MASS::rlm

code_rlm_model <- code_sam_comb %>% 
  mutate(signatures = factor(signatures, levels = sig_fct_lev)) %>% 
  group_by(signatures) %>% 
  group_map(~MASS::rlm(mutations ~ ages, data = .))

# rlm p value was computed with sfsmisc::f.robftest method because MASS::rlm doesn't give p values

code_rlm_p <- code_rlm_model %>% 
  map_dbl(possibly(~sfsmisc::f.robftest(., var = 'ages')$p.value, NA))

code_rlm <- code_rlm_model %>% 
  map_dfr(~broom::tidy(.)) %>% 
  filter(term != '(Intercept)') %>% 
  mutate(signatures = sig_fct_lev, p.value = code_rlm_p, 
         p_val_adj = p.adjust(p.value, method = 'BH')) %>% 
  select(signatures, estimate:p_val_adj)

18 mutational signatures showed significance with simple linear regression, while 23 signatures were found to be significant through robust linear regression!!!
That’s unbelievable because robust regression should be more robust and stricter!

robust::lmRob

code_lmRob <- code_sam_comb %>% 
  mutate(signatures = factor(signatures, levels = sig_fct_lev)) %>% 
#  arrange(mutations) %>% 
  group_by(signatures) %>% 
  group_map(possibly(~summary(robust::lmRob(mutations ~ ages, data = .))$coefficients[2, ], set_names(c(NA, NA, NA, NA), c('Estimate', 'Std. Error', 't value', 'Pr(>|t|)')))) %>% 
  bind_rows() %>% 
  rename(estimate = Estimate, std.error = `Std. Error`, statistic = `t value`, p.value = `Pr(>|t|)`) %>% 
  mutate(signatures = sig_fct_lev, p_val_adj = p.adjust(p.value, method = 'BH')) %>% 
  select(signatures, everything())

We could find that the p value of 22 signatures was unavailable, which means robust::lmRob encounter error when a Notably, when we

robustbase::lmrob

code_lmrob <- code_sam_comb %>% 
  mutate(signatures = factor(signatures, levels = sig_fct_lev)) %>% 
  group_by(signatures) %>% 
  group_map(possibly(~summary(robustbase::lmrob(mutations ~ ages, data = .))$coefficients[2, ], set_names(c(NA, NA, NA, NA), c('Estimate', 'Std. Error', 't value', 'Pr(>|t|)')))) %>% 
  bind_rows() %>% 
  rename(estimate = Estimate, std.error = `Std. Error`, statistic = `t value`, p.value = `Pr(>|t|)`) %>% 
  mutate(signatures = sig_fct_lev, p_val_adj = p.adjust(p.value, method = 'BH')) %>% 
  select(signatures, everything())
knitr::knit_exit()